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Abstract 


Marginalising out uncertain quantities within the internal representations or pa¬ 
rameters of neural networks is of central importance for a wide range of learning 
techniques, such as em pirical, variational or full Bayesian methods. We set out to 
generalise fast dropout (IWang & Manning l2013h to cover a wider variety of noise 
processes in neural networks. This leads to an efficient calculation of the marginal 
likelihood and predictive distribution which evades sampling and the consequen¬ 
tial increase in training time due to highly variant gradient estimates. This allows 
us to approximate variational Bayes for the parameters of feed-forward neural net¬ 
works. Inspired by the minimum description length principle, we also propose and 
experimentally verify the direct optimisation of the regularised predictive distri¬ 
bution. The methods yield results competitive with previous neural network based 
approaches and Gaussian processes on a wide range of regression tasks. 

1 Introduction 

Deep learning methods have started to become practical for a wide range of tasks where very many 
labeled examples for supervised training are available, especially in the domains of sensory process¬ 
ing (e.g., vision or audio tasks). Yet, methods which work well on data sets with few training cases 
in the context of regression of continuous quantities remain scarce. Frequentist schemes such as 
weight decay or heuristics such as dropout have so far not been able to deliver significant improve¬ 
ments over methods not stemming from the connectionist paradigm, such as Gaussian processes or 
random forests; consequently, deep learning methods are generally not considered in fields where 
learning should be realised on small data sets. 

We consider neural networks with parameters 9 as weights and biases. If we treat the parameters 9 
not as points, but summarise our belief about them via a distribution q{9), the data is explained by 
marginalising out that distribution, i.e. 



( 1 ) 


In the case of p{9) = q{9), i.e. g is a prior, this is commonly referenced to as the marginal like¬ 
lihood. We will consider supervised data only, that is Dtrain = {(*x, where the functional 

relationship x —z is of interest. In Bayesian learning, a prior p{9) is used for q to obtain a posterior 
via Bayes’ rule 


p{6\'Damn) = 


p{9)p{'Da^in\9) 

p(Arain) 


( 2 ) 
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which can then be used to form a predictive distribution for unseen data points 


p(z|x,2?tram) = [ p(z|x, 0) p(6»|Arain) 

Je 


( 3 ) 


In practice, Bayesian models are designed in a hierarchical way, where the prior is specified with 
the help of an additional hyperprior, i.e., p{9) = p(9jT))p(r)). 

In all but the most trivial cases, Bayesian learning comes with several difficulties which require ap¬ 
proximations. For neural networks, not only will the posterior p(0|X>train) be highly multimodal due 
to symmetries in the weight space, but it will also be intractable to find the normalisation constant 
p(2^train) = / p(Aram|0)p(^*)d0, i.e., the marginal likelihood. 


Due to this intractability for all but the simplest cases, neural network practitioners have to resort to 


approximation sc hemes such as samplin ! 
ational inference ( Hinton & Van Cami 
approximations (lMacKavill992h . 


^ngie^.. 


via Markov chain Mo nte Carlo lNea| (ll993lB . vari- 
combinations thereof (iGravesl 1201 ih . or Gaussian 


The contributions of thi s work will be as follows. We will extend the idea of fast dropout 
dWang & Manning 1201 3[) to the marginalisation of distributions over the weights of a neural net¬ 
work in Section 13 and introduce an efficient way to respect the correlations between outputs units 
in Section l3.lL This will be u sed to perform varia tional Bayes for the special case of a Gaussian 
likelihood function in Section l3.3l In Section l3.4l we will then propose a novel method to find a 
distribution over weights, namely the minimisation of the negative log-likelihood of the data plus a 
regularisation term. The proposed methods will be verified experimentally in Section 0. We model 
the distributions over weights using diagonal Gaussian as well as Bernoulli distributions. 


1.1 Related Work 


The idea to treat we ights in a neural network in a stochastic way, i.e. impose a distribution on them, 
goes back at least to lBuntine & Weigendl(ll99ll) . Albeit dated. iMacKavId 19951) is an excellent survey 
article on probabilistically motivated approaches to neural net works, containin g many concepts and 
ideas from the literature. Using sampling-based techniques. IGravesl ( 201 1|) develops a practica l 
algorithm based on Variational Inference (VI). More recently. [Hernande^Lobato & AdaiM d2015l) 
developed a method to treat units in neural networks in terms of their first two moments; they also 
develop a n ovel Bayesian learning alg orithm for such scenarios. Most relevant to this section are the 
results from lWang & Manningl ( 20 13 —in fact, their work served as a starting point for this paper. 
Even more recently iBlundell et alJ d20r5l) used stochastic weights and an objective function using 
the variational free energy. They used the reparametrisation trick from iKingma & Wellingl d2013b 
to backpropagate thro ugh the sampling pro cess itself. Most close to our work is the independently 
developed method by iKingma et al.l d2015h . who use fast dropout-like calculations to reduce the 
sampling effort and variance of the gradient estimators. 


2 Variance Propagation 

2.1 Propagation of Variance through a Transformation 

We are using Variance Propagation to compute the effect of marginalising out the w eight distribu¬ 
tion. This variance propagation is based on the works of I Wang & Manning (l2013b . where it was 
shown for the case of x = x • m where ~ <B(d) follows a Bernoulli distribution with rate d. 
Here, x is the input to the model corrupted by “dropout” noise. 

IWang & Manning! (1201 3b provides a set of rules for propagation of mean and variance through 
a network. Rules for multip lication and addition are defined by elementary facts of probability 
dGrimmett & Strrzakeii[T99^ . 

2.1.1 Propagation of Variance through a Linear Transformation 

We have a linear transformation a = w^x + b. If w,x and b are independent of each 
other, have sufficiently many components and finite mean and variance, the central limit theorem 
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dGrimmett & Stirzakeiill99^ applies. This makes a distributed approximately according to a Gaus¬ 
sian, i.e. a ~ J\f{E [a], V[a]). More specifically, consider a distribution q{0) over the parameters of 
the model with 0 = {w, b}. 

We obtain an approximation of the marginal likelihood (cf. Equation ([Jl)); 

p{a\x) = [q{9)p{a\x,0)d0 (4) 

Je 

^M(E[a],W[a]). 


All that is left to determine is then the expectation and variance of a. Since both are sums and/or 
products of quantities with known expectation and variance, the calculations are given by 


E [a] =E [w] E [x] + E 

,lT 


V[a] =V[&] -f V[w]" E [x]^ + V[x]" E [w]^ -f V[x]-' V[w], 


(5) 

( 6 ) 


where we have assumed once again that all components of x, b and w are independent. 


2.1.2 Propagation of Variance through a Non-linear Function 

While the propagation through transfer functions is not in general tractable, the fact that the in¬ 
tegral is one-dimensional allows for a wide range of approximations. The most straightforward 
is the use of a table. Oth er options include Monte Carlo integration and the unscented transform 
(lJulier & Uhlmannlll997h . Eor the rectifier transfer and the logistic sigmoid function, a closed form 
and a very good approximation are available, respectively. We present one of the m here for the sake 
of com pleteness, but refer the interested reader to the corresponding paper by IWang & Manning! 
(l2013l) for derivations. 

In the case of the rectifier /(a) = max(a, 0) = y, we have: 


E [y] =$(r)E [a] -f (/)(r) 

V[?/] =E [a] y/V[a]^{r) -f (E [a]^ -f V[a])$(r) - E [af 

where $(^) and are the cumulative distribution function and probability density function of the 
standard Normal, respectively. 

We want to stress the fact that propagating a through the transfer function by integrating over each 
of its components separately will introduce the assumption that all elements of a are statistically 
independent, which is certainly not completely justified. 


2.2 Variance Propagation for Deep Networks 

In the previous section we described how to obtain the output expectation and variance of linear and 
non-linear transformations given the expectations and variances of its inputs. Deep networks can be 
constructed by stacking many of these on top of each other. We apply these methods to multilayer 
perceptron networks with additional noise processes affecting the weights of the network. 

It should be noted that all operations are differentiable and thus gradient-based optimisation can be 
used. How ever, the equat i ons ar e rather complex and use of an automatic differentiation tool such 
as Theano (iBergstra et al.Ll2M^ is advisable. 


2.3 Noise Processes 

We now consider that the quantities w, x, b are corrupted versions of the true underlying quantities 
w, X, b. We will focus on x first, while the discussion is equivalent for w and b. We define a noise 
process to be a probability distribution over possible corruptions given a clean input, i.e. c(x|x). 
If we can obtain E [x] and V[x] given E [x], V[x] and c, we can integrate c seamlessly into the 
calculations. 
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Since we already gave the respective rules above, two obvious choices are additive and multiplicative 
noise. Given a vector of independent noise variables e with known expectation and covariance, let 
X = X + e, then 


E [x] = E [x] + E [e], 

V[x] = V[x]+V[e]. 

Analogously, if x = x • e, 

E [x] = E [x] • E [e], 

V[k] = E [x]^ V[e] + V[x]E [e]^ + V[x]V[e]. 


Depending on the exact nature o f e, several noise injecting regu l ariser s can be approxi mated, such 
as Dropout (iHinton et alil20l3) (as done by ^^ng & Manning! (l2013h l. DropConnect dWan et al.L 
I2OI3I) or Gaussian weight noise (lGravesL[2013 ). 


2.4 Soundness of the Approximation 

IWang & Manning! (1201 3h verified experimentally that the central limit theorem holds for deep neural 
networks in certain cases. This is, however, not possible in general and might fail in cases where 
inputs are low-dimensional or sparse. But is this at all important? Considering that we are only 
interested in a function approximator, the exact interpretations of different quantities in the network 
are unimportant. Loosely speaking, we do not care whether our model constitutes a good approxi¬ 
mation of a corresponding real model, as long as the model works well enough for the task at hand, 
as indicated by an estimate of the generalisation error. 


3 Fast Adaptive Weight Noise 


Adaptive weig ht noise is a practical method to perfor m Variational Bayes (VB) in neural networks 


__ igii 

(lGravesLl20111) . The method is based on the appr oach oflHinton & Van Camn 
Minimum Description Length (MDL) principle (iRissane nl 1 1 985i iGriinwald 
bias. 


1993, who utilise the 
120071) as an inductive 


As usual in the Bayesian setting, the parameters of the model under consideration are not found via 
point estimates, but represented as a distribution over the weight space. Here, each parameter 0i will 
be represented by a Gaussian, i.e. q{0i) = af). 

If we are given a likelihood function and we consider 5 as a variational approximation to the true 
posterior over the parameters having seen the data, the training criterion can be derived by means of 
VI: 

£vi ■=-^ [ 9(6')logp(*z|*x,6»)d6»-f KL[g(6»)||p(6»)] 

= - ^E [logpCzPx,^)]^^^ -f KL[g(6»)||p(0)] 

i 

logpCz|*x,6»^) -f ]KL[g(6»)||p(6»)], 


where the outer sum is over the training samples. The “trick” that iHinton & Van Cam^ d 19931) 
introduce is that the prior p{6) is not set or further specified by a hyper-prior but inst ead learned as 
any other parameter in the model and thus essentially set by data. The contribution of iGravesI (l2011l) 
was then to approximate the expectation in Equation (|3) by Monte Carlo sampling with Equation"®. 

Here we use the previously introduced techniques to find a closed-form approximation to adaptive 
weight noise. Consider a single layer with 9 — {w}, y = /(x^w), where we have no dropout 
variables and the weights are Gaussian distributed with w ^ with covariance diagonal 

and organised into a vector. Again, we assume a Gaussian density for a = x^w. Using the rules 


(7) 


0s - q{0) (8) 
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from Section 12, we find that 

F, fnl = F 



(9) 

( 10 ) 


A perspective that we have not taken on so far is that this is a convolution of point predictions, each 
performed by a slightly different neural network with weights drawn from their respective distribu¬ 
tions. Consider a neural network /(x, 0) with 9 = {6i}, where each 9i is a Gaussian distributed 
random variable with mean fii and variance af. Let the network represent a distribution p{z\d) for 
the random variable y, which is the network’s output. The output of the network with marginalised 
weights will be approximated as such: 



( 11 ) 


where q{d) depicts the joint over all weights and the moments of the Gaussian variable on the RHS 
are obtained as in Equations (12) and ([T2) . 


3.1 Output covariance 

While above we assumed all covariance matrices to be diagonal, we can easily and efficiently extend 
the last layer to explicitly model covariance in the output. 

Let X be an input to the last layer and W the weight matrix that maps this input to the output. Let 
w* o and w* p be two distinct columns of the weight matrix W and o = x^w* o and p = x^w* p 
their respective outputs given x. In this model, we assume that o and p are not independent and need 
to extend the equation of variance propagation for addition 


V[A + B]= V[A] -f V[B] + 2cov[A, B] 


( 12 ) 


for dependent outputs. 

Plugging o and p into eq. ([T2I) . rearranging and using eq. (12), we can derive a simple formula for the 


covariance. (Note that since this assumes independence of w„ ^ and w„ p, it does not hold for the 
diagonal entries of the covariance matrix.) 


2cov[o,p] =V[o + p] — V[o] — V[p] 

=V[x^w*_o + bo -I- x’^w*_p + bp] - V[x’^w*_o + bo] - V[x^w*,p + bp] 

=V[x^(w*,o + w*,p)] -f V[bo] + V[bp] 

- (V[x'^w*,o] + V[bo]) - (V[x'^w*,p] -f V[bp]) 

=V[x^(w*,o + w*,p)] - V[x'^w*_o] - V[x^w*,p]. 


(13) 


By applying rules for variance propagation from Section lzH and rearranging we arrive at: 


cov[o,p] =V[x]'^(pw.,o 


where o denotes the Hadamard product. 

We can show that the diagonal entries of the covariance matrix are computed in the same way as the 
variances of diagonal-covariance Fast Adaptive Weight Noise (FAWN). For the “additional” terms 
on the diagonal, we define 


V = diag(cr^ -f [x]^ -f V[x]'^cr^^ J 


and can then write the covariance matrix for full-covariance FAWN (Co-FAWN) in matrix notation: 


C = V + ^C, = V + ^ V[x,]pw... fiw,., 


where , is the i-th row of W. 
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3.1.1 Computational efficiency 


For matrices, which are updated by adding the outer product of two vectors, the Sherman-Morrison 
formula. 


(A + uv^)-i = 


A ^uv’^A ^ 

1 + v^A-iu’ 


(14) 


presents a means of updating the inverse with an outer vector product. Similarly, the determinant of 
such a matrix can be updated using the matrix inversion lemma: 

det(A + uv^) = (1 + v^A^^u) det(A) (15) 

We define 


Ai+i = Ai+ Uivf 

and can now recursively compute the determinant and inverse of C, which are needed to com¬ 
pute the loss, by setting Aq = V, for which inversion and determinant computations are cheap, 
Ui = V[a:i]E [wi,*] and Vi = E and repeatedly using eqs. dfil) and (O until we get the 

precision matrix A^^ = C~^ and its determinant respectively. 

The depth of the recursion corresponds to the number of hidden units in the last hidden layer n. 


3.2 Binary Weights 

In cases where memory and computational resources are limited, one can use Bernoulli distributed 
weights instead of Normal distributions. This will half the amount of parameters needed. When 
using Bernoulli-distributed weights the same variance propagation rules as for Gaussian distributed 
weights apply. The only difference is in the mean and variance of the weight noise process: 

E [w] = (p — 0.5)s, (16) 

V[w] =p(l-p)s2. (17) 

with s as an additional weight scaler parameter and p as the parameter defining a Bernoulli distribu¬ 
tion. This parameter s helps the network to learn a richer set of functions since it would otherwise 
be limited to values between zero and one. We compared the results against regular EAWN as shown 
in Tabled] as EAWN-BERN. 

3.2.1 Justification by Sampling 

We compared the empirical distribution of outputs from the binary weights network with the variance 
propagation estimation. Sampling from the output of a Bernoulli-distributed weights network is 
done by sampling weight matrices from the distribution of the weights w' ^ B{l,p) and scaling 
them with the parameters s through w = {w' — 0.5)s. These sampled weight matrices are then 
used in a standard neural network to produce a sample from p(z|x, 9). Histograms of these sampled 
outputs showed no significant deviation from the variance propagation approximation. 


3.3 Fast Variational Inference for Gaussian Likelihoods 

We will now use variance propagation to obtain an approximation to the first term of £vi for the 
special case of a Gaussian likelihood. 

Consider the first term of the RHS of Equation (|3) for the case that z is assumed to be a univariate 
Gaussian. We will thus write z for the targets and y for the output of the network and leave out the 
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dependency on 0 for brevity. Then, 

E[\ogp{z\y)] 

=E [\ogAf{z\y,a^)] 

'-{z-y? 


=E 


2 ct 2 


— log 


-E [(z - y)"^] 

2a‘^ 

= logA/'( v^vf^lO, cr^) + log A/'(z|E [y], 0-2) 

+ log V^cr. 

where we have made use of the identity V[y] = E [y^] — E [y]^. The last line offers a partially 
probabilistic interpretation of this specific instance of variational inference. It puts a zero-centred 
prior on the square root of the output’s variance and on the error, sharing the same (prior) variance— 
which is itself encouraged to be large. The last term can be seen as a measure against the variance 
collapsing to zero, which would lead to large likelihoods on the training set. We refer to this method 
as EAWN-VI. 



3.4 Optimisation of the predictive distribution with regularisation 

Since we now have an efficient approximation of the predictive distribution (cf. Equation ([^ and 
Equation ([Jl)), an obvious next step is to directly optimise it with respect to the parameter dis¬ 
tributions q{9). This will essentially lead to a maximum likelihood approach and thus inherit its 
tendency to overfit the training data. Accounting for that is possible by a fully Bayesian treatment, 
which means to impose a hyperprior on q{9) and integrate it out. 

Here we shall follow a different route, which is to make use of a regularise^ namely the KL- 
divergence between q{9) and a prior p{9): 

-Cfawn :=-^log [ q{9)p{"z\"x,9)d9+ KL[q{9)\\p{9)], 

, Je 

where the sum runs over the training samples Ptrain = {(*^i We refer to this method as 

EAWN-ROPD. 


4 Experiments 


We evalutated EAWN-VI and EAWN-ROPD from Sections l3.3l and [T4| respectively on a range of 
static regression tasks using Eeed-Eorward Neural Networks (EENs). We are interested in finding 
not only a point prediction but a whole predictive distribution. These tasks are typically not where 
neural networks excel and practicioners resort to Gaussian Processess (GPs) instead, which is why 
we compare to those. 


To this end we used a global univariate Gaussian for the prior and a Gaussian as a variational ap¬ 
proximation for each of the parameters: 


p{^) = 

i 


q{9) =]^A/'(6»i|/ii,(T2). 

i 

The KL-divergence is then given bjQ: 

KL[y(0) I \pi9)] = ^ log ^ ^^) 


1 

2 ' 


’obtained with the help of the Q&A community “crossvalidated” at 
http://stats.stackexchange.com/questions/74 4 0/kl-divergence-between-two-univa 


iate-gaussians 
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Table 1; Results for F AWN. Results for probabilisti c backp ropagation (PBP) and adaptive weight 
noise (VI) taken from [Hernandez-Lobato & Adams! ( l2015b . Results for GPs obtained via GPy 
dGPv authorsll2Q12-2014]) . where no results for the slightly bigger data sets (more than 1500 sam¬ 
ples) were obtained due to the increased run time. Best results shown in bold. 



VI 

PBP 

GP 

FAWN-VI 

FAWN-ROPD 

FAWN-BERN 

Boston 

2.903±0.071 

2.550±0.089 

2.631±0.289 

3.005±0.273 

2.559±0.161 

2.685±0.196 

Concrete 

3.391±0.017 

3.136±0.021 

2.893±0.095 

3.183±0.077 

3.107±0.134 

3.310±0.109 

Energy 

2.391±0.029 

1.982±0.027 

0.711±1.477 

1.762±0.655 

1.369±0.842 

2.095±0.077 

KinSNm 

0.897±0.010 

-0.964±0.007 

- 

-1.006±0.027 

-1.211±0.032 

-0.601±0.021 

Naval 

-3.734±0.116 

-3.653±0.004 

- 

-6.751±0.118 

-6.837±0.131 

-3.608±0.066 

Power Plant 

2.890±0.010 

2.838±0.008 

- 

2.849±0.042 

2.819±0.029 

2.859±0.031 

Protein 

2.992±0.006 

2.974±0.002 

- 

2.973±0.022 

2.882±0.068 

3.005±0.013 

Wine 

0.980±0.013 

0.966±0.014 

- 

0.943±0.037 

0.908±0.078 

0.934±0.085 

Yacht 

3.439±0.163 

1.483±0.018 

0.615±0.756 

1.448±0.393 

0.336±0.271 

3.201±0.191 

Year 

3.622± N/A 

3.603± N/A 

- 

3.807± N/A 

3.472± N/A 

- 


Table 2: Results for FAWN and Co-FAWN used on multi-output Table 3; Size of Datasets 
datasets. For the Jura dataset we train on the location coordinates_ ^ ^ 


only and 
ments. B( 

predict the local c 
;st results shown i 
N D out 

oncentrations of the six different ele- Boston 506 13 

„ . Concrete 1030 8 

n bold. 

FAWN-ROPD Co-FAWN ^^68 8 

Energy 

Naval 

Sarcos 

Jura 

768 8 2 

ir934 16 2 

48’933 21 7 

358 2 7 

2.1218±0.7024 2.1063±0.8357 

-14.9868±0.7368 -15.1074±0.3656 Power Plant 9568 4 

-4.4185±N/A -5.1867±N/A Protein 45’730 9 

11.1407±N/A 8.6396±N/A Wine 1599 11 

Yacht 308 6 

Year 515’345 90 


Additionally, we chose a Gaussian likelihood where we assumed that 

Zi = yi + Ci, ^ A/'(0, 0-1), 

which resembles a Gaussian distributed measurement error with variance ai for output dimension i. 
We integrate the ai into the set of parameters and optimise it jointly with all other parameters. 

All experiments were p e rform ed using a similar protocol to the one used in 
iHernandez-Lobato & Adarni (|2015|) : we used single-layer networks with 50 hidden units us¬ 
ing the rectifier transfer function. We report the negative log likelihood of the data with means and 
standard deviations coming from ten different random splits into 90% training and 10% testing 
data. The parameters of neural networks using FAWN were drawn from a zero-centred Gaussian 
with standard deviation 0.2. 

Training was performed using Adam dKingma & Bal 1201 4l) with a step rate of a = 0.001 until 
convergence of the training loss. No separate validation set was used. Gradients were estimated 
using 128 samples in a single mini batch. 

The results for GPs were obtained using a the sum of a linear and a squared exponential kernel 
using automatic relevance determination. Three random restarts were performed. We used GPy 
(IGPv authorsl |20 12-201 4|) for the experiments. 

The results are summarised in Table [ij. The proposed methods place themselves well among alter¬ 
native approaches, where FAWN-ROPD is better than FAWN-VI in all cases. 

5 Conclusion and Future Work 

We have proposed a method to approximate the marginal likelihood of a distribution over neural 
network weights up to its mean and variance. This enabled us to derive a deterministic approxi- 
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mation of variational Bayes for Gaussian likelihoods and propose a novel, less subjective flavour 
of variational inference, FAWN-ROPD. The experimental results show that FAWN-ROPD obtains 
competitive performance over a wide range of regression tasks. These tasks include ones with very 
little samples (order of a few hundred) as well as many samples (several thousands) and range from 
domains such as robotics, predictive maintenance, computational biology and others. 

The method requires further evaluation: we will experimentally investigate more common deep¬ 
learning architectures such as recurrent neural networks and deep multilayer perceptrons. Further, 
the suitability of FAWN for tasks where model uncertainty in the predictions is of interest, such as 
active learning or reinforcement learning, needs to be tested. On the theoretical side, the exact rela¬ 
tionship of FAWN-ROPD to reference priors remains unclear and a theoretically founded motivation 
for FAWN-ROPD is an important next step. 
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